Как альтернативы:
1\ распределения(и смеси). Например Пуассон, если система частотная. В рынке много рандома, работает не ахти, быстро считается
2\ правило 2(3) сигм. Те же линии Боллинджера. Запаздывают, знание получаем постфактум
3\ модель Хольта-Винтерса(Holt-Winters)
4\ семейство ARIMA
5\ Histogram-Based Outlier Detection (HBOS) ака анализ выбросов
6\ Ассорти из Фурье, Вейвлет и матана(практично, векторизуется, пределы применимости математически обоснованны. Не хочется растягивать и каждый символ разжевывать(выходной не резиновый). Плюс тут тоже профи сидят, сами все знают
7\ арсенал Машинного Обучения. Бегло во второй части
Историческая справка. Когда в 80е ребятки из Morgan запустили технологическую гонку "Черных коробок", SEC использовала этот подход. Тогда с нейронками было не очень, вычисления были оч дорогими и тд
import warnings
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
%matplotlib inline
sns.set()
warnings.filterwarnings("ignore")
data = pd.read_csv('./FILUSDT.csv', parse_dates=['Opened'], index_col='Opened')
# data['Date'] = data['Opened'].dt.date
# data['Time'] = data['Opened'].dt.time
data.head()
| Open | High | Low | Close | Volume | |
|---|---|---|---|---|---|
| Opened | |||||
| 2021-05-12 00:00:00 | 143.015 | 143.873 | 142.886 | 143.832 | 8531.7 |
| 2021-05-12 00:05:00 | 143.832 | 143.876 | 142.722 | 143.002 | 12412.8 |
| 2021-05-12 00:10:00 | 143.013 | 143.074 | 142.508 | 142.699 | 8786.2 |
| 2021-05-12 00:15:00 | 142.699 | 143.294 | 142.624 | 143.245 | 5540.3 |
| 2021-05-12 00:20:00 | 143.233 | 143.745 | 143.233 | 143.715 | 3495.6 |
data.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 46593 entries, 2021-05-12 00:00:00 to 2021-10-20 18:40:00 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Open 46593 non-null float64 1 High 46593 non-null float64 2 Low 46593 non-null float64 3 Close 46593 non-null float64 4 Volume 46593 non-null float64 dtypes: float64(5) memory usage: 2.1 MB
data.drop(['Open', 'High', 'Low', 'Volume'], axis=1, inplace=True)
data.head()
t = 24*12 # Последние 24 часа
data = data[-t:]
data.plot(figsize=(15, 6))
plt.show()
from pylab import rcParams
rcParams['figure.figsize'] = 12, 8
decomposition = seasonal_decompose(data, model='additive', period=12, extrapolate_trend='freq') # multiplicative
fig_d = decomposition.plot()
plt.show()
df_reconstructed = pd.concat([decomposition.seasonal, decomposition.trend, decomposition.resid, decomposition.observed], axis=1)
df_reconstructed.columns = ['trend', 'season', 'resid', 'actual_values']
df_reconstructed.head()
| trend | season | resid | actual_values | |
|---|---|---|---|---|
| Opened | ||||
| 2021-10-19 18:45:00 | 0.013492 | 63.173784 | -0.044276 | 63.143 |
| 2021-10-19 18:50:00 | -0.009724 | 63.157427 | 0.009297 | 63.157 |
| 2021-10-19 18:55:00 | -0.050537 | 63.141071 | -0.050534 | 63.040 |
| 2021-10-19 19:00:00 | -0.046609 | 63.124714 | -0.086105 | 62.992 |
| 2021-10-19 19:05:00 | -0.051419 | 63.108358 | -0.020939 | 63.036 |
df_reconstructed['resid'].describe()
count 288.000000 mean 0.003279 std 0.166765 min -0.503713 25% -0.078883 50% -0.008311 75% 0.061395 max 1.029474 Name: resid, dtype: float64
def get_tresholds(data, nv=3): # nv - норма в стандартных отклонениях
mean = data.mean()
sd = data.std()
return mean - nv * sd, mean + nv * sd
rcParams['figure.figsize'] = 12, 5
fig, axes = plt.subplots(ncols=2)
rm = df_reconstructed['resid'].rolling(window=12).mean()
mint, maxt = get_tresholds(rm)
df_reconstructed['resid'].plot(ax=axes[0])
axes[0].axhline(y=maxt, color='r', linestyle=':', lw=2)
axes[0].axhline(y=mint, color='g', linestyle=':', lw=2)
df_reconstructed['resid'].hist(ax=axes[1], bins=20)
<AxesSubplot:>
rm = df_reconstructed['resid'].rolling(window=12).median()
mint, maxt = get_tresholds(rm)
fig, axes = plt.subplots(ncols=2)
rm.plot(ax=axes[0])
axes[0].axhline(y=maxt, color='r', linestyle=':', lw=2)
axes[0].axhline(y=mint, color='g', linestyle=':', lw=2)
rm.hist(ax=axes[1], bins=20)
<AxesSubplot:>
# !pip install pycaret --user
data = pd.read_csv('./FILUSDT.csv', parse_dates=['Opened'], index_col='Opened')[-300:]
data['day'] = [i.day for i in data.index]
data['day_name'] = [i.day_name() for i in data.index]
data['hour'] = [i.hour for i in data.index]
data['is_weekday'] = [i.isoweekday() for i in data.index]
data.drop(['Open', 'High', 'Low', 'Volume'], axis=1, inplace=True)
from pycaret.anomaly import *
s = setup(data, session_id = 123, log_experiment=True)
| Description | Value | |
|---|---|---|
| 0 | session_id | 123 |
| 1 | Original Data | (300, 5) |
| 2 | Missing Values | False |
| 3 | Numeric Features | 2 |
| 4 | Categorical Features | 3 |
| 5 | Ordinal Features | False |
| 6 | High Cardinality Features | False |
| 7 | High Cardinality Method | None |
| 8 | Transformed Data | (300, 8) |
| 9 | CPU Jobs | -1 |
| 10 | Use GPU | False |
| 11 | Log Experiment | True |
| 12 | Experiment Name | anomaly-default-name |
| 13 | USI | 87ba |
| 14 | Imputation Type | simple |
| 15 | Iterative Imputation Iteration | None |
| 16 | Numeric Imputer | mean |
| 17 | Iterative Imputation Numeric Model | None |
| 18 | Categorical Imputer | mode |
| 19 | Iterative Imputation Categorical Model | None |
| 20 | Unknown Categoricals Handling | least_frequent |
| 21 | Normalize | False |
| 22 | Normalize Method | None |
| 23 | Transformation | False |
| 24 | Transformation Method | None |
| 25 | PCA | False |
| 26 | PCA Method | None |
| 27 | PCA Components | None |
| 28 | Ignore Low Variance | False |
| 29 | Combine Rare Levels | False |
| 30 | Rare Level Threshold | None |
| 31 | Numeric Binning | False |
| 32 | Remove Outliers | False |
| 33 | Outliers Threshold | None |
| 34 | Remove Multicollinearity | False |
| 35 | Multicollinearity Threshold | None |
| 36 | Remove Perfect Collinearity | False |
| 37 | Clustering | False |
| 38 | Clustering Iteration | None |
| 39 | Polynomial Features | False |
| 40 | Polynomial Degree | None |
| 41 | Trignometry Features | False |
| 42 | Polynomial Threshold | None |
| 43 | Group Features | False |
| 44 | Feature Selection | False |
| 45 | Feature Selection Method | classic |
| 46 | Features Selection Threshold | None |
| 47 | Feature Interaction | False |
| 48 | Feature Ratio | False |
| 49 | Interaction Threshold | None |
2021/10/31 11:37:07 WARNING mlflow.tracking.context.git_context: Failed to import Git (the Git executable is probably not on your PATH), so Git SHA is not available. Error: Failed to initialize: Bad git executable.
The git executable must be specified in one of the following ways:
- be included in your $PATH
- be set via $GIT_PYTHON_GIT_EXECUTABLE
- explicitly set via git.refresh()
All git commands will error until this is rectified.
This initial warning can be silenced or aggravated in the future by setting the
$GIT_PYTHON_REFRESH environment variable. Use one of the following values:
- quiet|q|silence|s|none|n|0: for no warning or exception
- warn|w|warning|1: for a printed warning
- error|e|raise|r|2: for a raised exception
Example:
export GIT_PYTHON_REFRESH=quiet
# Список моделек. Остановимся на чем попроще. Изолированный Лес
models()
| Name | Reference | |
|---|---|---|
| ID | ||
| abod | Angle-base Outlier Detection | pyod.models.abod.ABOD |
| cluster | Clustering-Based Local Outlier | pyod.models.cblof.CBLOF |
| cof | Connectivity-Based Local Outlier | pyod.models.cof.COF |
| iforest | Isolation Forest | pyod.models.iforest.IForest |
| histogram | Histogram-based Outlier Detection | pyod.models.hbos.HBOS |
| knn | K-Nearest Neighbors Detector | pyod.models.knn.KNN |
| lof | Local Outlier Factor | pyod.models.lof.LOF |
| svm | One-class SVM detector | pyod.models.ocsvm.OCSVM |
| pca | Principal Component Analysis | pyod.models.pca.PCA |
| mcd | Minimum Covariance Determinant | pyod.models.mcd.MCD |
| sod | Subspace Outlier Detection | pyod.models.sod.SOD |
| sos | Stochastic Outlier Selection | pyod.models.sos.SOS |
iforest = create_model('iforest', fraction = 0.1)
iforest_results = assign_model(iforest)
iforest_results.head()
| Close | day | day_name | hour | is_weekday | Anomaly | Anomaly_Score | |
|---|---|---|---|---|---|---|---|
| Opened | |||||||
| 2021-10-19 17:45:00 | 63.554 | 19 | Tuesday | 17 | 2 | 1 | 0.150182 |
| 2021-10-19 17:50:00 | 63.459 | 19 | Tuesday | 17 | 2 | 1 | 0.114432 |
| 2021-10-19 17:55:00 | 63.359 | 19 | Tuesday | 17 | 2 | 1 | 0.099471 |
| 2021-10-19 18:00:00 | 63.228 | 19 | Tuesday | 18 | 2 | 0 | -0.050734 |
| 2021-10-19 18:05:00 | 63.179 | 19 | Tuesday | 18 | 2 | 0 | -0.050279 |
import plotly.express as px
import plotly.graph_objects as go
# plot value on y-axis and date on x-axis
fig = px.line(iforest_results, x=iforest_results.index, y="Close")
# create list of outlier_dates
outlier_dates = iforest_results[iforest_results['Anomaly'] == 1].index
# obtain y value of anomalies to plot
y_values = [iforest_results.loc[i]['Close'] for i in outlier_dates]
fig.add_trace(go.Scatter(x=outlier_dates, y=y_values, mode = 'markers',
name = 'Anomaly',
marker=dict(color='red',size=10)))
fig.show()